In this section, I analyze the distribution of key metrics and investigate relationships between variables to understand player behavior before modeling.
2.1 Distribution of Key Metrics
Code
library(ggplot2)library(gridExtra)# Histogram of Scoresp1 <-ggplot(game_data_clean, aes(x = score)) +geom_histogram(binwidth =5, fill ="#4e79a7", color ="white", alpha =0.8) +labs(title ="Distribution of Player Scores", x ="Score", y ="Count") +theme_minimal()# Histogram of Game Durationp2 <-ggplot(game_data_clean, aes(x = game_duration)) +geom_histogram(binwidth =5, fill ="#f28e2b", color ="white", alpha =0.8) +labs(title ="Distribution of Game Duration", x ="Duration (seconds)", y ="Count") +theme_minimal()grid.arrange(p1, p2, ncol =2)
2.2 Death Reason Analysis
Understanding why players fail is crucial for adjusting game difficulty.
Code
# Bar chart for death reasonsgame_data_clean %>%filter(!is.na(death_reason)) %>%count(death_reason, sort =TRUE) %>%mutate(death_reason =reorder(death_reason, n)) %>%ggplot(aes(x = death_reason, y = n, fill = death_reason)) +geom_col(show.legend =FALSE) +coord_flip() +labs(title ="Common Causes of Death", x ="Death Reason", y ="Frequency") +theme_minimal()
2.3 Correlation Matrix
We check for correlations between numeric variables to identify potential predictors.
Code
library(corrplot)# Select numeric columns for correlationnum_vars <- game_data_clean %>%select(score, game_duration, coins_collected, ufos_shot, bullets_fired, pipes_passed)cor_matrix <-cor(num_vars, use ="complete.obs")corrplot(cor_matrix, method ="color", type ="upper", addCoef.col ="black", tl.col ="black", diag =FALSE,title ="Feature Correlation Matrix", mar =c(0,0,1,0))
3. Bootstrapping Data for Machine Learning
Since the original dataset is small (~300 rows), we will bootstrap the training set to create a larger dataset (10,000+ samples) for robust model training. We reserve the last 50 records as a strict holdout test set.
Code
set.seed(123)# 1. Split real data into Train (first 250) and Test (last 50)# Sorting by start_time ensures we respect temporal ordergame_sorted <- game_data_clean %>%arrange(start_time)train_base <-head(game_sorted, 250)test_holdout <-tail(game_sorted, 50)# 2. Bootstrap the training data to 10,000 samples# Sampling with replacement allows us to simulate a larger dataset based on observed patternsbootstrap_size <-10000train_bootstrapped <- train_base %>%slice_sample(n = bootstrap_size, replace =TRUE) %>%mutate(is_synthetic =TRUE) # Flag for tracking# Combine for verification (optional) but we will train on 'train_bootstrapped'cat("Original Train Size:", nrow(train_base), "\n")
We use a Random Forest model to predict the final score based on gameplay metrics.
Code
library(randomForest)library(caret)# Define featuresfeatures <-c("coins_collected", "ufos_shot", "bullets_fired", "game_duration", "pipes_passed")# Train Random Forest on Bootstrapped Datarf_model_score <-randomForest(as.formula(paste("score ~", paste(features, collapse ="+"))),data = train_bootstrapped,ntree =100,importance =TRUE)# Predict on Holdout Test Setpredictions_rf <-predict(rf_model_score, newdata = test_holdout)# Evaluate Performance (RMSE & R-squared)rmse_val <-RMSE(predictions_rf, test_holdout$score)r2_val <-R2(predictions_rf, test_holdout$score)cat("Random Forest Performance on Holdout Set:\n")
Random Forest Performance on Holdout Set:
Code
cat("RMSE:", round(rmse_val, 2), "\n")
RMSE: 1.8
Code
cat("R-Squared:", round(r2_val, 4), "\n")
R-Squared: 0.9818
Code
# Variable Importance PlotvarImpPlot(rf_model_score, main ="Feature Importance for Score Prediction")
5.2 Survival Analysis (Logistic Regression)
We predict whether a player will survive past a specific “expert” threshold (e.g., 30 seconds). This is a binary classification problem.
Code
# Define 'Survival' as lasting longer than 30 secondsthreshold <-30train_bootstrapped <- train_bootstrapped %>%mutate(survived_expert =as.factor(ifelse(game_duration > threshold, 1, 0)))test_holdout <- test_holdout %>%mutate(survived_expert =as.factor(ifelse(game_duration > threshold, 1, 0)))# Train Logistic Regression# We remove variables that directly calculate duration/score to prevent data leakage, focusing on behavioral countslog_model <-glm(survived_expert ~ bullets_fired + ufos_shot + coins_collected, data = train_bootstrapped, family ="binomial")# Predict probabilities on Test Setprobs_survival <-predict(log_model, newdata = test_holdout, type ="response")preds_survival <-ifelse(probs_survival >0.5, 1, 0)# Confusion Matrixconf_matrix <-confusionMatrix(as.factor(preds_survival), test_holdout$survived_expert)cat("Logistic Regression Accuracy:", round(conf_matrix$overall['Accuracy'], 4), "\n")
Logistic Regression Accuracy: 0.96
Code
print(conf_matrix$table)
Reference
Prediction 0 1
0 47 1
1 1 1
6. Business Insights & Recommendations
Based on the analysis above, we derive the following actionable insights:
6.1. Difficulty Balancing:
Observation: The death_reason analysis highlights the most common obstacles (e.g., pipes vs. enemies). If ‘pipe’ collisions are disproportionately high early in the game, the initial difficulty curve may be too steep.
Recommendation: Adjust the gap size or spawn rate of the leading cause of death in the first 10 seconds of gameplay to improve retention.
6.2. Player Segmentation Strategy:
Observation: K-Means clustering identified distinct groups. (Refer to cluster table: e.g., High-duration/low-coin collectors vs. Aggressive shooters).
Recommendation: Introduce targeted rewards.
For ‘Survivors’ (High duration, low action): Introduce time-based achievements.
For ‘Shooters’ (High bullets/UFOs): Offer weapon skins or visual upgrades for combat milestones.
6.3. Predictive Engagement:
Observation: The Random Forest model shows that specific actions (like coins_collected or ufos_shot) are strong predictors of high scores.
Recommendation: Create a tutorial or “Daily Mission” focusing on these high-value actions to teach new players how to achieve higher scores effectively.
6.4. Monetization Opportunities:
Observation: Players who survive past the 30-second threshold (analyzed in the Logistic Regression) show higher engagement.
Recommendation: Trigger “Continue?” ads or special offers only after a player has demonstrated this “expert” survival trait, as they are more invested in the session than a player who dies instantly.